Machine Learning in Finance
Module 3
Density-based Clustering
Hierarchical Clustering
Starts by each point as a cluster by itself
Combine the nearest clusters
Keep combining until all of them are clustered to one
Does not require specifying number of clusters
Hierarchical Clustering
Dendrogram
Choosing Number of Clusters
The length of dendrogram lines represent the distance between combined clusters
Rule of thumb: draw a horizontal line that crosses “large” distances
Examples
Density-based cluserting works well for perceptional data:
Princial Component Analysis
PCA
A Very popular algorithm in Machine Leaning and Finance.
A statistical method to reduce the dimensionality
- retainingas much information (or variance) as possible
Main Idea
Dimensionality reduction is like taking a photo
- Reduce dimensions: 4D (?) -> 2D
- Make sure everybody is visible (keep maximum information)
- Getting the right “angle” is important
Another example
Reduced dimensions can make (huge) distortions when not done properly
PCA
Formally PCA is:
Linearly transformation of original variables (p) into a new set variables (q <= p) such that
- New variables (principal components) are uncorrelated
- First principal component have maximum amount of variance
- then second, third, …
Use of PCA
Main use of PCA is the dimensionality reduction
- Powerful when large amount of features are redundant
- i.e., feature reduction algorithm
- Usually combined with other algorithms
- As a preprocessing in ML
- With K-means: bypass curse of dimensionality
- PCR: Principal Component Regression
More Technically: PCA
An orthogonal projection of data into lower dimension space: (e.g., 2D -> 1D)
that minimizes distance from original points (red) to projection (green)
that retains maximum variance between projected data points
Projection
Note when the red dots are:
- Spread out the most (maximum variance): Best
- Closely packed (minimum variance) : Maxium distortion (worst)
Why maximum variance?
Note that each dots are “distinct”:
In order to make them distinguishable in the lower dimension space, they must be spread out as much as possible.
Contrarily, if many of dots are crammed and overlapped, then we cannot distinguish them in lower dimension space.
Even more Technical: Computation behind PCA
- Eigen Decomposition (EVD)
Eigenvector is the new vector direction
Eigenvalue is similar to the length of eigenvector
The information content (or importance)
It is variance of the corresponding principal component
Generally faster
- Singular Value Decomposition (SVD)
Works on \(m \times n\) matrix
Scales well, numerically stable because it does not require computing covariance
PCA Algorithm (EVD)
- Standardize the data with \(p\) variables: \(X_{n \times p} \rightarrow Z_{n \times p}\)
- Compute the Gram matrix (or covariance matrix): \(G_{p \times p} = \frac{1}{n-1} Z^T Z\).
- Perform Eigenvalue Decomposition on \(G\): \[G_{p \times p} = V_{p \times p} \Lambda_{p \times p} V^T_{p \times p}\]
- \(V\): Matrix of eigenvectors (new directions, principal components).
- \(\Lambda\): Diagonal matrix with eigenvalues (variances explained, in descending order).
- Compute the principal components by projecting \(Z\) onto \(V\):
\[PC_{n \times p} = Z V_{p \times p}\]
- Variance Explained
- \(\Lambda\) contains the eigenvalues (variance explained by the \(i\)-th principal component).
- Percentage explained by the \(i\)-th component: \[\frac{\lambda_i}{\sum_{j=1}^{p} \lambda_j}\]
- Factor Loadings \[L_{p \times p} = V_{p \times p} \sqrt{\Lambda_{p \times p}}\]
- \(L\): Matrix of factor loadings, where \(L_{ij}\) is the loading of variable \(i\) on PC \(j\).
Factor Loadings
Factor loadings are weighted eigenvectors. Needed for Interpretable Machine Learning.
\[L_{p \times p} = V_{p \times p} \sqrt{\Lambda_{p \times p}}\]
\(L_{ij}\) is the loading of variable \(i\) on PC \(j\).
Interpretation:
- High absolute values in \(L\) mean a variable strongly influences that PC.
- Positive/negative signs show the direction of the relationship.
Takeaways of PCA
If you have a data with 1,000 observations and 200 variables:
Q1. How many principal components (PCs) will I have?
- It will still give you 200 PCs (if not specified further)
Q2. How’s PCA output look like?
- Same dimensions with changed numbers, 1,000 observations and 200 variables.
Q3. How can it reduce the variables, then?
- From the PC output, remove the rightmost columns as many as you want.
Q4. How many PCs should I choose?
- Depends, but using variance explained makes most sense.
Q5. Which PC is the most important?
- The first column. It’s arranged automatically.
Q6. What do you mean most important?
- Most variance, or information content.
Q7. Is that importance same as Eigenvalue?
- That is correct. Each column has its own eigenvalues.
Q8. How many Eigenvalues will it have?
- Same number of original, 200.
Algorithm Summary
Input:
An data with \(n\) observations and \(p\) features
\[ X_{n\times p} \]
Primary Output:
A data with \(n\) observations and new \(q\) features (\(q \leq p\))
\[ PC_{n \times q} \]
Extra output:
- Explained variance (Eigenvalues) by each principal components
\[ \lambda_{1,2,3,...,q} \]
- Factor loadings for \(q\) component
\[ L_{p \times q} \]
Factor loadings: How those new \(q\) principal components are constructed by original data?
Step ty Step in Code
Steps
- Standardize the data with
- Compute the Gram matrix (or covariance matrix): \(G_{p \times p} = \frac{1}{n-1} Z^T Z\).
- Perform Eigenvalue Decomposition on \(G\): \[G_{p \times p} = V_{p \times p} \Lambda_{p \times p} V^T_{p \times p}\]
- Compute the principal components by projecting \(Z\) onto \(V\):
- Variance Explained
- Factor Loadings
Notes
Usually ML packages handels all steps at once, and shows the output summary nicely.
The code is to demonstrate the steps performed behind the scene.
For simplicity, I’m intentionally making two variables that are highly correlated.
- If two are (highly) correlated, then the information gain from the other is less
Sample data
Create Y as linearly correlated variable. This is to show how PCA captures the maximum variance when reducing dimensions.
Step 1. Standardize
Z-score standardization:
Step 2. Gram Matrix
\[G_{p \times p} = \frac{1}{n-1} Z^T Z\]
Step 3. Eigen Decomposition
Eigen Value Decomposition: \[G_{p \times p} = V_{p \times p} \Lambda_{p \times p} V^T_{p \times p}\]
Step 4. Generate Principal Components
\[PC_{n \times p} = ZV \]
Variance Explained
Eigenvalues represent the variance captured by each principal component. Calculate the proportion of variance explained:
Factor Loadings
Factor loadings help understand how original variable contribute to principal components. \[L = V\cdot \text{diag}(\Lambda)\]
[,1]
[1,] 1.365999
[2,] 1.414214
- X and Y equally contribute to construct PC1
Visual Summary
Out of 2 dimension data:
PC1 captures maximum variablity of the data
PC2 captures the remaining variability
The more correlated, the more effective dimensionality reductions
Visual Summary
Variance Explained by Original
Variance Explained by PCs
Lab Walkthrough
H2O
In this step, we perform Principal Component Analysis (PCA) using the H2O framework.
First, the dataset is converted to an H2O frame while excluding non-numeric columns (
Country,Abbrev).The PCA is performed using the
h2o.prcomp()function:k = 4: specifies the number of principal components to compute.transform = "STANDARDIZE": standardizes (center and scale) all variables before applying PCA.
Country risk data
Real GDP growth (from IMF) : Higher the better
Corruption Index (Transparency International) : Higher the better (no corruption)
Peace Index (Institute for Economics and Peace) : Lower the better (very peaceful)
Legal risk index (Property Rights Association) : Higher the better (favorable)
Browse data
# A tibble: 6 × 6
country abbrev corruption peace legal gdp_growth
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Albania AL 35 1.82 4.55 2.98
2 Algeria DZ 35 2.22 4.43 2.55
3 Argentina AR 45 1.99 5.09 -3.06
4 Armenia AM 42 2.29 4.81 6
5 Australia AU 77 1.42 8.36 1.71
6 Austria AT 77 1.29 8.09 1.60
Variable Correlations
Initiate H2O
Initiate H2O for ML
Build PCA model
# Convert data to H2O frame, removing non-numeric columns
country_risk_h2o <- as.h2o(
country_risk
)
# Build PCA model
pca_h2o <- h2o.prcomp(
training_frame = country_risk_h2o,
x = c("corruption", "peace", "legal", "gdp_growth"),
k = 4, # number of pricipal components (in this case, p = q)
transform = "STANDARDIZE" # center & scale data
)Calculate Principal Components
To generate PCs, simply make prediction with the PCA model.
Variance Explained
The model summary provides details of variance explained.
Model Details:
==============
H2ODimReductionModel: pca
Model ID: PCA_model_R_1771945215033_166
Importance of components:
pc1 pc2 pc3 pc4
Standard deviation 1.600254 1.001183 0.614453 0.243450
Proportion of Variance 0.640203 0.250592 0.094388 0.014817
Cumulative Proportion 0.640203 0.890795 0.985183 1.000000
H2ODimReductionMetrics: pca
No model metrics available for PCA
Factor Loadings
To see the factor loadings for each PC: we need to pull eigenvalues and eigenvectors.
Homework Exercise
mtcars data
With mtcars data,
Perform PCA and report explained variances of each principal component.
Reduce dimensions into 2 using PCA.
Report Principal Component dataframe with 2 columns
How many variables should there be to explain at least 95% or variations?
Homework Reading
Homework Reading
John C. Hull “Machine Learning in Business”
- Chapter 2